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Abstract. In this paper, we study different discrete data clustering 
methods, which use the Model-Based Clustering (MBC) framework with 
the Multinomial distribution. Our study comprises several relevant is¬ 
sues, such as initialization, model estimation and model selection. Addi¬ 
tionally, we propose a novel MBC method by efficiently combining the 
partitional and hierarchical clustering techniques. We conduct experi¬ 
ments on both synthetic and real data and evaluate the methods using 
accuracy, stability and computation time. Our study identifies appropri¬ 
ate strategies to be used for discrete data analysis with the MBC meth¬ 
ods. Moreover, our proposed method is very competitive w.r.t. clustering 
accuracy and better w.r.t. stability and computation time. 

Keywords: Multinomial Distribution, Model-Based Clustering. 


1 Introduction 

Model-Based Clustering (MBC) estimates the parameters of a statistical model 
for the data and produces probabilistic clustering [6, 7, 15, 19]. To use the MBC 
method for clustering data as well as automatically selecting K (number of 
clusters), it is necessary to generate a set of candidate models. A simple approach 
to generate these models is to separately estimate them using an Expectation- 
Maximization (EM) method [13] with I\ = 1,..., K max . However, it can be 
computationally inefficient for higher dimensional data and higher K max value. 

Figueiredo and Jain [5] proposed a MBC method that integrates both model 
estimation and selection task within a single EM algorithm. A different strategy, 
called hybrid MBC [19], generates a hierarchy of models from K max clusters by 
merging the parameters. Indeed, such an approach naturally saves computation 
time as it does not explicitly learn K = K max — 1,... 1 components models from 
the data. In this paper, we propose a hybrid MBC method with the Multinomial 
Mixture (MM) model and then empirically compare it with other MBC methods. 
Moreover, we explicitly addresses two related issues: (1) initialization [3]: how to 
set the initial parameters for the EM method and (2) model selection [2]: which 
criterion to use for selecting the best model. Therefore, based on an empirical 
study, we aim to answer the following questions: (a) which method should be 
used for initialization? (b) how to efficiently generate a set of models? (c) what 
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is the difference among “learning from data” and “estimating from K max model 
parameters”? and (d) what is the best model selection method? 

Our overall contribution is to perform a comparative study among different 
MBC methods with the MM. Individually, we: (1) propose (Sec. 3.6) a novel 
MBC method and compare it with the state-of-the-art methods; (2) perform 
empirical study on different initialization methods (Sec. 3.3) and (3) compare 
different model selection methods (Sec. 3.5). We conduct experiments with syn¬ 
thetic and real text data (for document clustering [19]) and identify particular 
methods that should be used for initialization, candidate models estimation and 
model selection. Therefore, the above contributions and experiments will natu¬ 
rally answer the questions raised at the end of the previous paragraph. 

In the remaining part of this paper, we study the background and related 
work in Sec. 2, discuss different methods in Sec. 3, present the experimental 
results with discussion in Sec. 4 and finally draw conclusions in Sec. 5. 


2 Background and Related Work 

Model-Based Clustering (MBC) [6, 15] is a well-established method for cluster 
analysis and unsupervised learning. MBC assumes a probabilistic model (e.g., 
mixture model) for the data and then estimates the model parameters by op¬ 
timizing an objective function (e.g., model likelihood). The Expectation Maxi¬ 
mization (EM) [13] is mostly used in MBC to estimate the model parameters. 
EM consists of an Expectation step (E-step) and a Maximization step (M-step) 
which are iteratively employed to maximize the log likelihood of the data. 

MBC methods have been exploited with the Gaussian distribution to analyze 
continuous data [6, 15, 5, 2, 7]. Besides, they have been proposed to analyze 
discrete data using the Multinomial distribution [14, 17] and directional data 
using the directional distributions [1, 9, 10]. In this paper, we only study and 
compare the MBC methods with the Multinomial distribution. 

The Multinomial Mixture (MM) is a statistical model which has been used 
for cluster analysis with discrete data [14, 20, 17]. Meila and Heckerman [14] 
studied the MBC methods with MM and compared them w.r.t. accuracy, time 
and number of clusters. They found that the EM method significantly outper¬ 
forms others, which motivates us to solely focus on the EM related approaches. 

Initialization of the EM method has significant impact on the clustering 
results [13, 3, 12], because with different initializations it may converge to dif¬ 
ferent values of the likelihood function, some of which can be local maxima, i.e., 
sub-optinral results. To overcome this, several initialization strategies have been 
proposed, see [3] for details. Meila and Heckerman [14] investigated three ini¬ 
tialization strategies for the EM with MM. In this paper, we consider their [14] 
observations as well as empirically evaluate additional initialization methods for 
the EM method which were discussed by Biernacki et al. [3]. 

In order to automatically select K (number of components), MBC method 
can be used by first generating a set of candidate models with different values of 
K and then selecting the optimal model using a model selection criterion [6, 15]. 
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This strategy needs to address two issues: (a) how to generate the models? and 
(b) how to select the best model? This paper considers both of these issues. 
Particularly, we focus on the candidate models generation task and propose a 
novel solution based on the Hybrid MBC (HMBC) [19] method. 

HMBC method is a two-staged model that exploits both partitional and hi¬ 
erarchical clustering. It begins with a partitional clustering with K max clusters 
and then use the Hierarchical Agglomerative Clustering (HAC) on those cluster 
parameters to generate a hierarchy of mixture models. It has differences with 
the Model-Based Hierarchical Clustering (MBHC) which employs the HAC on 
each data point [6]. In practice, for a large number of samples, such MBHC 
method is inefficient w.r.t. the required time and memory [19]. Several HMBC 
methods have been proposed with different probability distributions, see [19] , [8] , 
[10] and [18]. Among these, [18] proposed a method in the context of Bayesian 
analysis. However, it requires an explicit analysis of the features, which can 
be computationally inefficient for higher dimensional data. An efficient mixture 
model simplification/fusion method is recently proposed in [8] for the Gaussian 
distribution and in [10, 9] for the directional distributions. They use informa¬ 
tion divergences among the mixture models. In this paper, we follow a similar 
approach and propose a novel HMBC method with the MM. 

Model selection is one of the most prominent issues in cluster analysis [15, 
5, 2, 7]. In general, a statistical model selection criterion is often used with the 
MBC method, which is also called the parsimony-based approach [15]. See [5] 
for a list of different criteria. A different approach performs model selection by 
analyzing an evaluation graph, see [16] for such a method called the L-method. 
To select model with MM, [14] uses the likelihood value. Recently, [17] proposed 
the Minimum Message Length (MML) criterion for the MM. In this paper, we 
aim to present a comparative study among these methods. 

This paper has similarity with two previous work [14] and [17]. However, 
the key differences are: (1) it proposes a novel method to efficiently generate 
candidate models; (2) investigate additional initialization methods proposed in 
[3] and (3) explore a wide range of model selection methods. 

3 Methodologies 

In the following sub-sections, first we present the model for the data, then discuss 
the relevant algorithms and finally propose a complete clustering method. 

3.1 Multinomial Mixture Model 

Let x.j = ■ ■ ■ , is a D dimensional discrete count vector of order V, 

i.e. x i,d = V. Moreover, x, is assumed to be an independent realization of 

the random variable X, which follows a R-order Multinomial distribution [4]: 

( v 

1? St,2, • • • 5 %i,D 



M(yii\V,n) 


(1) 
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here, p is the D dimensional parameter with 0 < /id < 1 and Yld=i Pd = 1- The 
set of samples can be modeled with a Multinomial Mixture (MM) model of K 
components: 

K 

f(xi\0 K )=^2-n- k M{xi\V,n k ) (2) 

k =1 

In Eq. (2), Ok = {( 7r i I Mi); ■ ■ ■, ( 7 tk,Hk)} is the set of model parameters, 
7 Tfe is the mixing proportion with Y^^=i n k = 1 and A4(xi\V, /j, k ) is the density 
function (Eq. (1)) associated with the k th cluster. 

3.2 Expectation Maximization Method 

To cluster data with the model (Eq. (2)), we estimate its parameters using an 
Expectation Maximization (EM) [13] method that maximizes the log-likelihood: 

N K 

Me) = £ log'^ j Tr k M{x i \fj, k ) (3) 

i=i k =1 

where N is the number of samples. In the Expectation step (E-step), we compute 
posterior probability as: 


Pi,k =p{zi = k |xj) = 


n D %i,d 

d= 1 Pk,d 

E K T~fD Xi 

1=1 *1 n«f=i Pi,, 


(4) 


where Zi £ {0,1} A denotes the cluster label of the i th sample. In the Maximiza¬ 
tion step (M-step), we update 7 r k and as: 


N 




= tv ^ pi,k and Pk,d = 


v-riV 

/ \ Pi,k %i,d 

E N ^~\D 

i= 1 2-jr—l Pi,k X i,r 


(5) 


The E and M steps run iteratively until certain convergence criterion (e.g., dif¬ 
ference of log-likelihood) is met or until a maximum number of iterations. 


3.3 Initialization for the EM Method 

The EM method requires the initial values of the parameters as an input. We 
examine the following five methods to initialize the EM: 

— Random: set the initial values randomly with 0 < pd < 1 and Y^d=i P d = 1- 
rndEM [12]: run a large number of random start and select the one which 
provides maximum likelihood value (Eq. (3)). 

— Small EM (smEM) [3]: run multiple short runs of randomly initialized 
EM and choose the one with the maximum likelihood value. Here, short run 
means we do not wait until convergence and stop the algorithm when limited 
number of EM iterations is completed. 
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Classification EM (CEM) [3]: it is similar to the smEM, except a clas¬ 
sification stage is inserted between the E and M steps. The classification 
step involves assigning each point to one of the K components using the 
conditional probabilities (Eq. (4)) computed in the E step. 

— Stochastic EM (SEM) [3]: it is similar to the smEM, except a stochastic 
step is inserted between the E and M steps. The stochastic step assigns x, at 
random to one of the mixture components K according to the Multinomial 
distribution with the conditional probabilities (Eq. (4)). 

3.4 Candidate Models Generation 

Multiple EM (Mul-EM): This is the simplest way to generate the candidate 
models. In this approach, the EM method is run K rnax times to generate the 
candidate models with K = , K rnax clusters. 

Integrated-EM (Int-EM): This approach [5, 17] do not explicitly generates 
the candidate models. Instead, it employs a single EM method that estimates the 
MM with AT clusters and evaluate it at the same time. It begins with K = K max 
clusters and estimate its parameter. Then it annihilates a cluster with minimum 
7 Tfc and estimate parameters with A' — 1 clusters. This process continues within a 
single EM method until K = 1. See the EM-MML algorithm of [17] for details. 

EM followed by Hierarchical Agglomerative Clustering (EM-HAC): 

This is our proposed model generation method, which aim is to generate a hier¬ 
archy of Multinomial Mixture (MM) models. Therefore, we exploit the Hierar¬ 
chical Agglomerative Clustering (HAC) on the mixture model parameters Ok- 
In general, the HAC permits a variety of choices based on three principal issues: 
(a) the dissimilarity measure between clusters; (b) the criterion to select the 
clusters to be merged and (c) the representation of the merged cluster. 

We use the symmetric Kullback-Leibler Divergence [4] (sKLD) as a measure 
of the dissimilarity between two Multinomial distributions as: 



DkL (Ha, Vb) + D KL {fj-b, Va) 

2 


sKLD = 


( 6 ) 


We choose “minimum sKLD” as the merging criterion (issue (b)). Besides we 
use the “complete linkage” criteria which is determined empirically. 

In this clustering strategy, the set of models is represented by their parame¬ 
ters. After determining the clusters to be merged, similar to [8, 10], we compute 
the merged cluster parameters (issue (c)) as: 


^merged 


and M merged, 


Hie& sub ni ^i 


(7) 


le&s^b 



where O su b C <9ic maa . ■ As an outcome, we obtain a set of MMs with different AT, 
which will be explored further for model selection. 
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3.5 Model Selection 


Consider that, after HAC we have a set of MMs with K max ,..., 1 components. 
The task of model selection can be defined as selecting the mixture model 
with I\ 0 components such that 0k o — {(tti, Pi), • • ■ , (^Ka, Pk 0 )} ■ We consider 
parsimony-based [15] and evaluation graph based [16] methods in this work. 

In the parsimony-based method[15], an objective function is employed, which 
minimizes certain model selection criteria. Such criteria involve the negative log 
likelihood augmented by a penalizing function in order to take into account 
the complexity of the model. One of the most widely used criteria is called the 
Bayesian Information Criterion (BIC) [6]: 

BIC(K) = -2 L(0) + vlog (TV) (8) 


where v = KD — 1 is the number of free parameters of the MM. The Integrated 
Completed Likelihood (ICL) criterion adds BIC with the mean entropy [2]: 

N 

ICL(K) = BIC(K) - 2^1og(p(« i |x i )) (9) 

i=1 


where p(zj|xj) is the conditional probability of the classified class label Zi £ 
{1, ..., K} for the sample x, : . The Minimum Message Length (MML) criterion, 
which has been recently proposed for MM, has the following form [17]: 


MML(K) 


§ E >»e 

k:nk >0 




TV 

12 


K nz (D + 1 ) 
2 


L(6>) (10) 


where K nz is the number of clusters with non-zero probabilities. After computing 
the values of the model selection criteria for different K £ {1,..., K max }, we select 
K 0 as the one that provides the minimum value of certain criterion. 

For the evaluation graph based method, we consider the L-method (see [16] 
for details), where the knee point is detected in the plot constructed from the 
BIC values. The idea is to fit two lines at the left and right side of each point 
within the range 2,...,K max — 1. Finally, select the point as K a that minimizes 
the total weighted root mean squared error. 


3.6 Complete clustering method with MM 

We propose a complele clustering method with the MM which clusters data and 
selects the number of clusters automatically. It consists of the following steps: 

— Step 1: Apply the EM algorithm (Sec. 3.2) to estimate MM parameters 
with K m ax clusters, i.e., O krna:c - 

— Step 2: Apply the HAC method (Sec. 3.4) on Ok ma3> to generate a set of 
models {O k }k=k mM - 1 ,..., 2 - 

— Step 3: Apply a model selection method (Sec. 3.5) to select 0k o , i.e., the 
mixture model with the optimal number of components K a . 
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4 Experimental Results and Discussion 

We conduct experiments using both simulated and real data. For the evaluation, 
we compute the Adjusted Rand Index (ARI) [11], which is a pair counting based 
similarity measure among two clustering. Therefore, high value of ARI indicates 
highly similar clustering and hence high accuracy. For a dataset, we compute 
the ARI among the clustering result of a particular method and the true labels. 

We evaluate the methods using the clustering accuracy, stability and com¬ 
putation time. We run each experiment 10 times and record the average value 
of the ARI as the accuracy, standard deviation of the ARI as the stability 1 and 
the average computation time. 

4.1 Experimental Datasets 

Simulated Datasets: We draw a finite set of discrete count vectors % = 
{ x i}i,...,N from MMs with different numbers (3, 5 and 10) and types: well- 
separated (ws) and not well-separated ( nws ) of clusters. Similar to [17], the 
types are verified using the sKLD 2 values. We consider samples of different di¬ 
mensions: 3, 5, 10, 20 and 40. For each MM, we generate 100 sets of data each 
having 1000 i.i.d. samples. In the synthetic data generation process, first we con¬ 
tract a MM model with K clusters. The model parameters (p fc ) for each cluster 
is sampled from a Dirichlet distribution. The order (14) of each cluster is sam¬ 
pled randomly from a certain range between 0.5D to 1.5 D. After determining 
the cluster parameters (/z fe ) and orders (14) we draw the data samples. 

Real Datasets: We consider 8 text datasets used in [20]. They consist of dis¬ 
crete count vectors, extracted from different documents collections. The choice 
was due to its good representation of different characteristics, such as the num¬ 
ber of observations (documents), number of features (terms) and the number of 
clusters. The chosen datasets are listed in Table 1. We refer the readers to the 
Sec. 4.2 of [20] for additional details about the construction of these datasets. 

4.2 Comparisons 

First we compare the initialization strategies listed in Sec. 3.3 and consistently 
use the best one for the rest of the experiments. Afterward, we evaluate the 
model generation methods discussed in Sec. 3.4. Finally, we evaluate the model 
selection strategies discussed in Sec. 3.4. 

1 Stability provides a measure of robustness w.r.t. different initializations. A stable 
method should provide similar results for different runs, irrespective of its initializa¬ 
tion. Therefore, a smaller value of the standard deviation indicates similar results 
for different runs and hence higher stability of the clustering method. 

2 A lower sKLD value among the cluster parameters indicates well-separated clusters, 
whereas higher value indicates less separation or a certain amount of overlap. Besides 
computing the sKLD value, we also verified the separation by observing the Bayes 
error rate among the clusters. 
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Table 1. Document text datasets for real data experiments. N denotes num¬ 
ber of samples, D denotes number of features and K denotes the number 
of clusters. The source of the datasets are - NG20: 20 Newsgroups, Classic: 
ACM/CISI/CRANFIELD/MEDLINE, Ohscal: OHSUMED, Klb: WebACE, Hitech: 
SJM-TREC, Reviews: SJM-TREC, Sports: SJM-TREC and Lal2: LAT-TREC. 



NG20 

Classic 

Ohscal 

Klb 

Hitech 

Reviews 

Sports 

Lai 2 

N 

19949 

7094 

11162 

2340 

2310 

4069 

8580 

6279 

D 

43586 

41681 

11465 

21839 

10080 

18483 

14870 

31472 

K 

20 

4 

10 

6 

6 

5 

7 

6 


Initialization Methods: The experimental settings for the initialization meth¬ 
ods (see Sec. 3.3) consist of: 1 trial for Random, 100 trials for rndEM, 5 trials 
with 50 maximum EM iterations for smEM and CEM and 1 trial with 500 
maximum EM iterations for SEM. The initial parameters obtained from these 
methods are experimented with the EM method discussed in Sec. 3.2. Fig. 1 
illustrates the results w.r.t. the clustering accuracy for both simulated 3 and real 
datasets. From all experimental results we have the following observations: 

— For the simulated data, the smEM is the best method while the CEM is very 
competitive. However, for the real data smEM provides the best accuracy 
(except the sport dataset). The second choice is the CEM method. 

— In terms of stability, smEM is the best for simulated data and CEM is best 
for the real data. 

— In terms of computation time, these methods can be ordered as follows: 
Random < rndEM < CEM < smEM < SEM. 

Similar to [14], we emphasize on the clustering accuracy as the main criteria to 
evaluate the initialization methods. Therefore, we choose the SEM method for 
further experiments. 


Model Generation Methods: In this experiment, we aim to generate a set 
of candidate models with the methods discusses in Sec. 3.4. Among them, the 
Mul-EM and EM-HAC explicitly generate the models and the Int-EM generates 
them implicitly. All methods are initialized with the smEM method. Moreover, 
same initializations are used in Int-EM and EM-HAC. Settings of these methods 
consist of: 100 maximum number of EM iterations, 10 -5 as the convergence 
threshold for the log-likelihood difference, K m i n = 2 and K m ax = 15, execept 
for NG20 K max = 30. Fig. 2 illustrates a comparison of these methods w.r.t. the 
accuracy 4 and stability. Table 2 provides a comparison 5 of the computation time 
for real data. From all experimental results we have the following observations: 

3 Due to limited space, we show results only for nws simulated samples with I\ = 3. 

4 This computation considers that the true numbers of clusters are known. 

5 Time comparison for the synthetic data provides similar observation as real data. 
Therefore, to save space we do not present those results. 
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(a) (b) 


Fig. 1. Illustration of the accuracy of the initialization methods, computed from: (a) 
simulated nws samples with K = 3 and (b) real text datasets. 



(a) (b) 




(c) (d) 


Fig. 2. Illustration of the clustering accuracy in (a) and (b), and stability in (c) and 
(d) for the model generation methods, (a) and (c) are computed from the simulated 
nws samples; (b) and (d) are computed from real text datasets. 
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— For the simulated data: EM-HAC and Int-EM are very competitive w.r.t. 
accuracy and time (results not shown). EM-HAC is the best on stability. 
Mul-EM was always performing worse except in a very few experiments. 

— For the real 1 ’ data, no single method outperforms others w.r.t. the accuracy. 
EM-HAC performs best in 3 datasets, Int-EM is best in 4 datasets and 
Mul-EM is best in 1 dataset. EM-HAC is best w.r.t. the stability (7 out of 
8 datasets). Most interestingly, EM-HAC shows significantly better perfor¬ 
mance in terms of computation time as it is ~ 2.5 times faster than Int-EM 
and ~ 9 times faster than Mul-EM. 

Based on the above experiments and observations, we can suggest that Int-EM 
is preferred when only accuracy is concerned. However, EM-HAC is preferred 
when stability and time has importantce besides accuracy. 


Table 2. Comparison of the computation time (in seconds) among the model genera¬ 
tion methods. 



NG20 

Classic 

ohscal 

klb 

hightech 

reviews 

sports 

lal2 

EM-HAC 

108.5 

6.9 

19.2 

3.8 

3.6 

9.9 

17.7 

19.9 

Int-EM 

353.2 

10.8 

42.2 

9.6 

8.2 

21.7 

46.3 

44.2 

Mult-EM 

2844.0 

54.4 

95.6 

29.1 

20.7 

59.1 

104.1 

134.6 


Model Selection Methods: We evaluate different model selection criteria 
(see Sec. 3.5) with the EM-HAC. Moreover, we consider the MML with Int-EM , 
also called EM-MML , as proposed in [17]. Fig. 3 illustrates a comparison with 
both simulated and real data w.r.t. the rate of correct number of components 
selection. Our observations from these results are as follows: 

— For the simulated data: BIC provides the best rate (except K = 3). ICL 
is equivalent to the BIC for higher K. Rate of MML decreases with the 
increase of K. Moreover, MML performs better with EM-HAC rather than 
with Int-EM. The LM provides mediocre accuracy for all clusters. The LLH 
criterion fails significantly. 

— For the real data: LM provides very good (~ 90%) rate for 4 ( classic, high- 
tech, review and lal2) datasets. Among the other methods, MML shows 
success in the review dataset, LLH is successful for the classic dataset. 

From the above observations we realize that, the L-method (LM) is the best 
choice with the proposed clustering method. However, we want to emphasize 
that it is yet necessary to conduct further research on the model selection issue 
as there is no single method which uniquely provides reasonable rate for all data. 


In this paper we are interested only to compare different MM based MBC methods. 
We refer readers to [20] for a comparison among different other methods. From [20] 
we observed that, the mixmns ( Mul-EM in this paper) performs better than the 
non-MBC methods, such as the kmns (k-means) and the skmns (spherical k-means). 
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Fig. 3. Illustration of the rate of correct model selection, results in (a) are computed 
from the simulated samples and results in (b) are computed from real text datasets. 


5 Conclusions 

In this paper, we present a comparative study among different clustering meth¬ 
ods with the Multinomial Mixture models. We experimentally evaluate the re¬ 
lated issues, such as initialization, model estimation and generation and model 
selection. Besides, we propose a novel method for efficiently estimating the can¬ 
didate models. Experimental results on both simulated and real data show that: 
(a) small run of EM (smEM) is the best choice for initialization (b) proposed 
hybrid model-based clustering, called EM-HAC is the best choice for candidate 
models estimation and (c) L-method is the best choice for model selection. As 
future work, we foresee the necessity to conduct further research on the model 
selection issue. Moreover, it is also necessary to evaluate these methods on more 
real-world discrete datasets obtained from a variety of different contexts. 
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